Homework 11: Maps and Networks

General instructions for all assignments:

  • Use this file as the template for your submission. Delete the unnecessary text (e.g. this text, the problem statements, etc). That said, keep the nicely formatted “Problem 1”, “Problem 2”, “a.”, “b.”, etc
  • Upload a single R Markdown file (named as: [AndrewID]-HW11.Rmd – e.g. “sventura-HW11.Rmd”) to the Homework 11 submission section on Blackboard. You do not need to upload the .html file.
  • The instructor and TAs will run your .Rmd file on their computer. If your .Rmd file does not knit on our computers, you will be automatically be deducted 10 points.
  • Your file should contain the code to answer each question in its own code block. Your code should produce plots/output that will be automatically embedded in the output (.html) file
  • Each answer must be supported by written statements (unless otherwise specified)
  • Include the name of anyone you collaborated with at the top of the assignment
  • Include the style guide you used below under Problem 0




Easy Problems

Problem 0

(4 points)

Organization, Themes, and HTML Output

  1. For all problems in this assignment, organize your output as follows:
  • Organize each part of each problem into its own tab. For Problems 2 and 5, the organization into tabs is already completed for you, giving you a template that you can use for the subsequent problems.
  • Use code folding for all code. See here for how to do this.
  • Use a floating table of contents.
  • Suppress all warning messages in your output by using warning = FALSE and message = FALSE in every code block.
  1. For all problems in this assignment, adhere to the following guidelines for your ggplot theme and use of color:
  • Don’t use both red and green in the same plot, since a large proportion of the population is red-green colorblind.
  • Try to only use three colors (at most) in your themes. In previous assignments, many students are using different colors for the axes, axis ticks, axis labels, graph titles, grid lines, background, etc. This is unnecessary and only serves to make your graphs more difficult to read. Use a more concise color scheme.
  • Make sure you use a white or gray background (preferably light gray if you use gray).
  • Make sure that titles, labels, and axes are in dark colors, so that they contrast well with the light colored background.
  • Only use color when necessary and when it enhances your graph. For example, if you have a univariate bar chart, there’s no need to color the bars different colors, since this is redundant.
  • In general, try to keep your themes (and written answers) professional. Remember, you should treat these assignments as professional reports.
  1. What style guide are you using for this assignment?


Problem 1

(2 points each)

Read

  1. Read this article. Write 1-3 sentences about what you learned from it.

  2. Read this article. Write 1-3 sentences about what you learned from it.

  3. Read the in-depth description of the ggmap package in the short paper by David Kahle and Hadley Wickham here. Write 1-3 sentences about what you learned from it.

  4. Read the article on ggmap here. Which functions can you use to create geographic heat maps?

  5. Read this description of the ggnetwork package. What does the fortify() function do for networks?



Problem 2

(2 points)

Requesting Final Project Group Members

Fill out the survey here. Even if you don’t have any requests for group members for the final project, please still fill out the survey.

You must fill this out by Tuesday at 5pm in order to receive credit and have your group member requests considered.

We cannot guarantee that any of your final partner requests will be granted.



Problem 3

(2 points each)

Maps with ggmap

Install and load the ggmap package. This package can be used to access maps from Google’s Maps API.

  1. Look at the help documentation for the get_map() function. What does it do? What are the different map sources that can be used in get_map()?

  2. In the help documentation, describe the zoom parameter. Roughly, what would be an appropriate value of this parameter if we wanted to display a square with width 1 mile? (Just a rough estimate is fine; an exact number is not required.)

  3. In the help documentation, what are the different maptype values that can be used? Which of these is unique to Google Maps?

  4. What does the map in the following code block show? Describe it. Explain what each of the parameters in the get_map() and ggmap() functions are doing.

(Note: Before doing this, you may need to install the most updated versions of these packages from GitHub – see commented code below.)

#devtools::install_github('hadley/ggplot2')
#devtools::install_github('thomasp85/ggforce')
#devtools::install_github('thomasp85/ggraph')
#devtools::install_github('slowkow/ggrepel')
library(ggmap)
map_base <- get_map(location = c(lon = -79.944248, lat = 40.4415861),
                    color = "color",
                    source = "google",
                    maptype = "hybrid",
                    zoom = 16)

map_object <- ggmap(map_base,
                    extent = "device",
                    ylab = "Latitude",
                    xlab = "Longitude")
map_object

  1. Recreate the map in part (d). Try changing the zoom parameter to a non-integer value (e.g. 16.5). What happens?

  2. Type class(map_object). What kind of object is your map?



Problem 4

(2 points each)

Finding Latitudes and Longitudes

There are many ways to find latitude and longitude coordinates of specific places. Here’s one easy way:

  1. Go to Google Maps. Type in “times square, nyc” and hit enter. The map should center around New York City. Now, look at the URL in your internet browser. After the @ symbol, the latitude and longitude of the center of the map are displayed (in order). What is the latitude of the map centered on Times Square? What is the longitude?

  2. After the latitude and longitude, the zoom level is displayed (e.g. “17z”). Change this to zoom level 12, and delete any text to the right. This should give you a map that displays most of New York City. Do the latitude/longitude coordinates change when you do this?

  3. Using the code from Problem 3d as a template, create a black and white (color = "bw") map of NYC in R, centered near Times Square, at a zoom level of 12, and with a roadmap map type. Describe the map that is output in R.



Problem 5

(2 points each)

Read Your HW09 Feedback On Blackboard

  1. Write at least one sentence about what you did well on in the assignment.

  2. Write at least one sentence about what you did wrong on the assignment.



Hard Problems

Problem 6

(30 points total)

Mapping US Flights

  1. (5 points) Load the airports and routes datasets from the Lecture 21 R Demo on Blackboard. Use ggmap to create a map centered on the continental United States. Add points corresponding to the location of each airport, sized by the number of arriving flights at that airport.

  2. (15 points) Recreate your plot in (a). This time, manipulate the routes and airports datasets so that you can use geom_segment() to draw a line connecting each airport for each flight listed in the routes dataset. That is, draw a line that connects the departing airport and the arrival airport. Do this only for flights that either depart from or arrive at the Los Angeles airport (LAX).

  3. (5 points) Recreate your plot from (b). Change the lines on your plot so that they are now curved arrows indicating the start and end points of each flight route. See the help documentation for geom_segment() here for how to do this. Do this only for flights that either depart from or arrive at the Los Angeles airport (LAX).

  4. (5 points) Calculate the change in time zones for each flight. When doing this, New York City to Los Angeles should be +3, and Los Angeles to New York City should be -3. Recreate your graph in (c). This time, color the lines by the change in time zones of each flight. Use a color gradient to do this. Be sure to include a detailed legend. Do this only for flights that either depart from or arrive at the Los Angeles airport (LAX).



Problem 7

(4 points each; 32 points total)

Networks, igraph, and ggnetwork

The igraph package makes network analysis easy! Install and load the igraph and igraphdata packages. Load the UKfaculty network into R:

#install.packages("igraph")
#install.packages("igraphdata")
library(igraph)
library(igraphdata)
data(UKfaculty)
  1. Try out some of the built-in functions in the igraph package in order to summarize the UK faculty network. How many nodes (“vertices”) are in the network? (Use vcount().) How many edges (“links”) are in the network? (Use ecount().)

  2. The igraph package has some built-in functions for analyzing specific nodes/vertices in the network. Use the neighbors() function to find both the in-degree and out-degree of the 11th UK faculty member. How many friends does faculty #11 claim to have? How many other faculty members claim that #11 is their friend? (See Professor Rodu’s notes for how to use the neighbors() function.)

  3. Write a function, called get_degree, that calculates the in-degree or out-degree (number of neighbors) that a given node in a network has. Your function should take three inputs: the node index/number, the network itself, and the degree type (in or out). The code is started for you below.

#  node:  The node index in the network
#  network:  The network to use in the degree calculations
#  type:  The degree type; must be either "in" or "out"
get_degree <- function(node, network, type) {
  #  Your code here
}
  1. Write a new function, called get_all_degrees, that calculates the in- or out-degree of an entire network. Your function should call the function you wrote in part (c).
#  network:  The network to use in the degree calculations
#  type:  The degree type; must be either "in" or "out"
get_all_degrees <- function(network, type) {
  #  Your code here
}
  1. Apply your function to the UK faculty network dataset twice – once for each degree type (in or out). Create a new, three-column data.frame that contains the results. One column should contain the in-degrees, one should contain the out-degrees, and a third column should contain the node index/number.

  2. Create a scatterplot of the in-degrees vs. the out-degrees, and change the point type to be their node index/number. Describe the graph. Which nodes are “overconfident” in their popularity (i.e. they claim to have many friends, but not as many others claim that they are friends with this person)? Which nodes are “underconfident” (i.e. many others claim that they are friends with this person, but this person does not claim to have many friends)?

  3. Finally, let’s actually visualize the network itself! Install and load the ggnetwork package. This package is very new – it was just released on March 28th, 2016. You can read more about how to use it in the article from Problem 1. Visualize the UK faculty network. The code below should get you started (be sure to add a title, remove the x- and y-axis labels and tick marks, and adjust the legend as necessary). Feel free to further adjust the graph as you see fit (not required) – see the article at the link above for more ideas.

Note that you’ll also need to install and load the intergraph package. Finally, depending on which version of ggplot2, R, and RStudio you’re running (new versions of all three were just released), you may need to update your packages to the latest versions that are on GitHub (see the first commented line of code below):

#devtools::install_github("briatte/ggnetwork")
#devtools::install_github("mbojan/intergraph")
library(ggnetwork)

uk_data <- fortify(UKfaculty)
node_degrees <- igraph::degree(UKfaculty, mode = "in")
uk_degrees <- node_degrees[match(uk_data$vertex.names, 1:length(node_degrees))]
uk_data$degree <- uk_degrees
ggplot(uk_data, 
       aes(x, y, xend = xend, yend = yend)) +
  geom_edges(arrow = arrow(length = unit(0.3, "lines")), 
             aes(color = as.factor(Group)), alpha = 0.5) +
  geom_nodetext(aes(label = vertex.names, size = degree * 1.5), 
                color = "blue", fontface = "bold")

  1. Describe the network. How many groups (“cliques”) do there appear to be? In the above graph, to what does the size of the nodes correspond? To what does the color of the edges correspond? Is the graph a directed or an undirected graph?


Bonus Problems

Bonus 1

(15 points)

Create geom_acf()

Many of you didn’t have time to complete this last week, so you get another shot this week. If you already turned in something for this problem, you can resubmit an updated version, but you will only receive bonus credit for this assignment (not for a previous assignment).


Create a new ggplot() geometry called geom_acf() that will create the same graph you made in Problem 7 of Homework 09, but in a more typical ggplot() way.

You should first read this vignette from Hadley Wickham on extending the ggplot2 package, where he demonstrates how to create new geoms and stats.

Write two functions:

  • fortify(): This function should take two inputs – a data.frame (e.g. trips_per_day) and a column name specifying the column of the data.frame that contains the time series data (e.g. “n_trips”). It should output a data.frame with two columns – lag and ac. Lag should contain the lag ID number, and ac should contain the autocorrelation of the time series at that lag.
  • geom_acf() (and, if necessary, StatACF())

Demonstrate that the following code works once you write these functions:

  • ggplot(fortify(trips_per_day, n_trips), aes(x = lag, y = ac)) + geom_acf()
  • ggplot(fortify(trips_per_day, n_trips)) + geom_acf(aes(x = lag, y = ac))


Bonus 2

(15 points)

Airport Network

Your task in this problem is to create a network diagram of the airport / flight data from Problem 6. To do this, take the following steps:

  1. Create an adjacency matrix for the top 250 airports in the dataset (where “top” is defined as the airports with the most total incoming and outgoing flights), and store it in an object called airport_adj. You should create the matrix so that airport_adj[ii,jj] is proportional to the total number of flights between airport ii and airport jj in the dataset.

  2. Use the graph_from_adjacency_matrix() function in the igraph package to convert your adjacency matrix to an object of class igraph.

  3. Use an approach similar to what you did in Problem 7 to create a ggplot() network diagram of the airports data. In your graph, do all of the following:

  • Give the graph an appropriate title, and remove all axis elements (axes, labels, tick marks, etc)
  • Size the edges by the frequency of flights between the two nodes (airports)
  • Color the nodes by their daylight savings time category, and include an appropriate legend; alternatively, color by continent for an extra bonus point
  • Change the line type of the edges so that domestic flights (flights between two airports in the same country) solid lines and international flights (flights between two airports in different countries) have dotted or dashed lines, and include an appropriate legend
  1. Describe the resulting network diagram. Point out any interesting features you see.